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Abstract 


An increasing number of applications is concerned with recovering a 
sparse matrix from noisy observations. In this paper, we consider the 
setting where each row of the unknown matrix is sparse. We establish 
minimax optimal rates of convergence for estimating matrices with row 
sparsity. A major focus in the present paper is on the derivation of lower 
bounds. 

1 Introduction 

In recent years, there has been a great interest for the theory of estimation in 
high-dimensional statistical models under different sparsity scenarii. The main 
motivation behind sparse estimation is based on the observation that, in several 
practical applications, the number of variables is much larger than the number 
of observations, but the degree of freedom of the underlying model is relatively 
small. One example of such sparse estimation is the problem of estimating of a 
sparse regression vector from a set of linear measurements (see, e.g., [2], [5], [16], 
[23]). Another example is the problem of matrix recovery under the assumption 
that the unknown matrix has low rank (see, e.g., [8, 20, 14, 15]). 

In some recent papers dealing with covariance matrix estimation, a different 
notion of sparsity was considered (see, for example, [7], [19]). This notion is 
based on sparsity assumptions on the rows (or columns) Mi. of matrix M. 
One can consider the hard sparsity assumption meaning that each row Mi. of 
M contains at most s non-zero elements, or soft sparsity assumption, based 
on imposing a certain decay rate on ordered entries of Mi.. These notions of 
sparsity can be defined in terms of /g—balls for g S [0,2), defined as 



( 1 ) 


where s < oo is a given constant. The case q = 0 
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corresponds to the set of vectors v with at most s non-zero elements. Here I(-) 
denotes the indicator function and s > 1 is an integer. 

In the present note, we consider this row sparsity setting in the matrix signal 
plus noise model. Suppose we have noisy observations Y = {yij) of an ni x n 2 
matrix M = {rriij) where 

i = l,...,ni, j = l,...,n 2 , (3) 

here, are i.i.d Gaussian A/’(0, ct^), > 0, or sub-Gaussian random variables. 

We denote hy E = {^ij) the corresponding matrix of noise. We study the 
minimax optimal rates of convergence for the estimation of M assuming that 
there exist q G [0,2) and s such that Mi. G Bg(s) for any i = 1,..., m. 

The minimax rate of convergence characterizes the fundamental limitation 
of the estimation accuracy. It also captures the interdependence between the 
different parameters in the model. There is an rich line of work on such fun¬ 
damental limits (see, for example, [13, 21, 11]). The minimax risk depends 
crucially on the choice of the norm in the loss function. In the present paper, 
we measure the estimation error in jj • || 2 ,p-(quasi)norm for 0 < p < oo (for the 
definition see (4)). 

For ni = 1, we obtain the problem of estimating of a vector belonging to a 
Bg(s) ball in This problem was considered in a number of papers, see, for 
example, [9], [3], [1], [17]. Let r/vect denote the minimax rate of convergence with 
respect to the squared Euclidean norm in the vector case. It is interesting to note 
that the results of the present paper show that, for the case p = 2, the minimax 
rate of convergence for estimation of matrices under the row sparsity assumption 
is niT/vect- Thus, in this case, the problem reduces to estimation of each row 
separately. The additional matrix structure does not lead to improvement or 
deterioration of the rate of convergence. We show that it is also true for general 
P- 

A major focus in the present paper is on derivation of lower bounds, which is 
a key step in establishing minimax optimal rates of convergence. Our analysis is 
based on a new selection lemma (Lemma 1). The rest of the paper is organized 
as follows. In Section 1.1, we introduce the notation and some basic tools used 
throughout the paper. Section 2 establishes the minimax lower bounds for 
estimation of matrices with row sparsity in jj • j| 2 ,p-norm, see Theorems 1 and 2. 
In Section 3, we derive the upper bounds on the risks using a reduction to the 
vector case. Most of the proofs are given in the appendix. 

1.1 Definitions and notation 

Let A be a matrix or a vector. For 0 < q < oo and A G _ (a^), we 

/ \ 1/9 

denote by JjAjjq = j ) the elementwise Zq-(quasi-)norm of A, and 
by IJAjjg the number of non-zero coefficients of A: 

llAllo = ^I(a,,^0) 
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where I(-) denotes the indicator function. For any A = (Ai.,..., G 

R"iX ”2 p > 0 define 


\ i/p 

■ (4) 

For p = 2, ||A|| 2,2 is the elementwise l 2 -norm of A and we will use the notation 
II • || 2,2 = II • Ih- For 0 < p < 1, we have the following inequality 

\\A+Ar2,p<\\A\\i^+\\A'r2,^. 

For q G [0, 2) and s > 0 we define the following class of matrices 

A{q, s) = {A G : Ai. G ]Bq(s) for any i = 1,..., ni}. (5) 

In the limiting case q = 0, we will also write 

^(s) = {A G g Bo(s) for any * = m}. (6) 

We set N„^xn 2 = {(*) j) ^ 1 < * < 1 < i < ''^ 2 }- For two real numbers a and 

b we use the notation a A b := min(a, 6), a V b := max(a, 6); we denote by 
[ccj the integer part of x] we use the symbol C for a generic positive constant, 
which is independent of ni, n 2 , s and a and may take different values at different 
appearances. 


i|24||2,p= E 




2 Lower bounds 


We start by establishing the minimax lower bounds for estimation of matrices 

over the classes .A(s) (Theorem 1) and A{q,s) (Theorem 2). We denote by inf 

A 

the infimum over all estimators A with values in Consider first the case 

q = 0. 

Theorem 1. Let ni,n 2 > 2 and p > 0. Fix an integer 1 < s < n2/2 . Assume 
that for {i,j) € N„jxn 2 ^^e noise variables f,ij are i.i.d Gaussian Af{0,a'^), cr^ > 
0. Then, 

ft) 


iiif sup P 
d Ae.A(s) 


^Il 2 .p >Ca^ (ni)^/P s log (^) } > /3; 


ft ft 


inf sup E||T —T|| 2 _p 
A A^A{s) 


>Ca^ (ni)^/^ s log . 


where 0 < /3 < 1, C > 0, and C > 0 are absolute constants. 
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Proof. It is enough to prove (i) since (ii) follows from (i) and Markov inequality. 
For a A S we denote by Pa the probability distribution of M{A,a‘^I) 

Gaussian random vector where / denotes (nin 2 ) x (nin 2 ) identity matrix. We 
denote by KL{P,Q) the Kullback-Leibler divergence between the probability 
measures P and Q. 

To prove (i) we use Theorem 2.5 in [21]. It is enough to check that there 
exists a finite subset il' of -4(s) such that for any two distinct B, B' in II' we 
have 

(a) \\B- B'\\\p > Ca^{ni)P/^s log(^), 

(b) KL(Pb,Pb') < log (card n') 

for some constants C > 0 and 0 < a < 1/8. 

Denote by {0, Ij^^xna of all matrices A = {oij) £ R”ixn 2 that 

aij £ {0,1} and each row of A contains exactly s ones. For any two matrices 
A = (aij) and A' = (aF) in {0, Ij^jxns define the Hamming distance 


dH{A,A')= ^ 

(ij')eN„jxr.2 


We use of the following selection lemma proved in Appendix A. 

Lemma 1. Let ni,n 2 > 2 and 1 < s < 712 / 2 . Then, there exists a subset D of 
{0, l}^jxn 2 sttc/i that for some numerical constant C > 10“^ 

log(|D|) > Cms log (7) 

and, for any two distinct A, A' in LI, the Hamming distance satisfies 

dHiA,A')> ^^ + ( 8 ) 

Fix 0 < 7 < 1 and define 


D' = < cr 7 




A 


Ag ft 


where D is a set satisfying the conditions of Lemma 1. For p = 2 using ( 8 ) we 
obtain that for any two distinct B,B' in LI' 


\B-B'\\l> 


2 2 

7 a ni s 
16 




This implies (a) for p = 2. For p ^ 2 we will use the following elementary 
lemma, cf. Appendix B. 

Lemma 2. If A = (aij) and A' = are two elements of {0 ,l}®jxn2 

such that d//(A, A') > then the cardinality of the set J(A,A') = 


16 


1 < 7 < m : y] > 7777 f’ greater than or equal to —i. 

L *j / tg i <‘) \ f\A 


1=1 


32 


64 
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Lemma 2 implies that for any two distinct B, B' in Q! 


\B- B 


p /2 




2 2 
7 (7 

-7 

641+2/p 


( 9 ) 


n 2 /P< 


which yields (a) for p ^ 2. 

To check (b), note that dH{A,A') < 2nis for all A, A' S {0,l}^jxn2- 
implies 

KL(Pb, PsO = 7^\\B- B'Wl < 7^ ni s log (^) . (10) 

2 (T^ Vs/ 

Since also |n| = |n'|, from (7) and (10) we deduce that (b) is satisfied with 
a < 1/8 if 7 > 0 is chosen sufficiently small. This completes the proof of 
Theorem 1. □ 

Note that there are possible sparsity patterns which satisfy the hard 

sparsity condition on the rows. By standard bounds on binomial coefficients, 
we have log ((’),^) x nislog (—)■ Consequently, the rate nislog (7^) corre¬ 
sponds to the logarithm of the number of models. 

Let us turn out to the soft sparsity scenario. For any 0 < g < 2 and s > 0 
define the quantity 


V{s) 



cr^ log 



cr‘1 712 
s 


1-9/2N 


V (jli V (ni 772 cr^) 


( 11 ) 


The minimax lower bound is given by the following theorem proved in Ap¬ 
pendix C. 

Theorem 2. Let ni,n 2 > 2. Fix 0 < q < 2 and s > 0. Suppose that for 
{i,j) S Nnjxn 2 the noise variables ^ij are i.i.d Gaussian > 0. 

Then, there exists a numerical constant c* such that 

f^J 


inf sup P I ||A — AII 2 > c* 77 ( 5 ) I >/3, 

d AeA(g,s) 1 1 


where 0 < ,3 < 1 and 


(ii) 


inf sup E||A — AII 2 > c* 77 ( 5 ). 
A AeA(q,S) 
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3 Minimax rates of convergence 


Consider the problem of estimating of a vector v = (vi) G Bg(s) C from 
noisy observations 

^/^ = Vi+0, z = l,...,n2, 

where are i.i.d. Gaussian jV(0,a^), > 0. 

The non-asymptotic minimax optimal rate of convergence for estimation of 
V in the / 2 ~norm, obtained in [3], is given by 


'Hvect {' 5 ) 



when q = 0 and by 


Vvectis) = I S cr^ log ( 1 + 


1 l-q/2\ 


V V (n2 cr^) 


when 0 < g < 2. 

We see that, for p = 2, the lower bounds given by Theorems 1 and 2 are 
nir]y(,ct{s) in the case of hard sparsity and niriyect[s) in the case of soft sparsity. 
We get the same rate as when estimating each row separately. This implies 
that, in this particular case, the additional matrix structure does not lead to 
improvement or to deterioration of the rate of convergence. 

As shown below and in view of the lower bounds of Theorems 1 and 2, opti¬ 
mal rates for arbitrary p can be also obtained from vector estimation method. 
It suffices to apply to the rows of M a minimax optimal method for vector esti¬ 
mation on ]Bq(s) balls. One can take, for example, the following penalized least 
squares estimator M of M (cf. [3]): 


M= argmin (- A||^ -H A|| A||o log f ^ ^ j (12) 

AGR"iX" 2 I Vll^^llo V 1/ J 

where A > 0 is a regularization parameter. The penalty in (12) is inspired by 
the hard thresholding penalty ||A||o, which leads to nnj that are thresholded 
values of j/y (see, for instance [12], page 138). 

The penalized least squares estimator defined in (12) can be computed effi¬ 
ciently. Let denote the jth largest in absolute value component of Y. The 
estimator M is obtained by thresholding the coefficients of V: we keep such 
that 

yfj) > ^ (log(enin 2 )-}-^(-l)*+-^+^z log(i) 

\ i=2 

and set all other coefficients equal to zero. 

In what follows we assume that the noise variables ^ij are zero-mean and 
sub-Gaussian, which means that they satisfy the following assumption. 
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Assumption 1. E('^y) = 0 and there exists a eonstant K > 0 sueh that 


AVp for all p>l 


for any 1 < i < ni and 1 < j < n 2 - 

This assumption on the noise variables means that their distribution is dom¬ 
inated by the distribution of a centered Gaussian random variable. This class 
of distributions is rather wide. Examples of sub-Gaussian random variables are 
Gaussian or bounded random variables. In particular, Assumption 1 implies 
that E(C2 ^ <2K^. 

The next theorem presents oracle inequalities for the penalized least squares 
estimator M, both in probability and in expectation. 

Theorem 3. Let M be the penalized least squares estimator defined in (12), a > 
1 and A = 2aKoK^ where Kq > 0 is large enough. Suppose that Assumption 1 
holds. Then, for any A > 0 


\\M-M\\l < inf 


n “t" 1 


\\M-A\\l + CK^\\A\\olog 


AeR'*iX ”2 ( a — 1 
with probability at least 1 — 2 exp {— }, and 


e ni 712 


I V 1 


2a^ 

0 — 1 ' 

(13) 


■A 


E\\M-M\\l< inf f^\\M-A\\l + CK^A\\o\og(-^^^)]+CK^ 
AeK"iX"2(a-l \||A||oVl/J 

(14) 

where C, Co and C are numerical constants. 

For the particular case of Gaussian noise, the result (14) of Theorem 3 is 
proved in [3], and the result (13) in [4]. Theorem 3 extends the analysis to the 
case of sub-Gaussian noise. The prooof is given in Appendix D. 

Now suppose that M G Al(s). Using Theorem 3 and the inequality 

\\M - M\\2,p <nY^-^/^\\M - Mh 


that holds for any 0 < p < 2 we obtain the following corollary. 

Corollary 1. Let M be the penalized least squares estimator defined in (12) 
with A = Ko where Kq > 0 is large enough. Suppose that Assumption 1 
holds and that M G Al(s). Then, for all 0 < p < 2 and for any A > 0 

\\M-M\\lp<CK^nl/Pslog{^)+A (15) 

with probability at least 1 — 2 exp { — }, and 


E\\M-M\\lp<CK^nl^^s\og(^^). (16) 
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These inequalities shows that, for 0 < p < 2, the penalized least squares 
estimator (12) achieves the rate of convergence given by Theorem l.This implies 
that this rate is minimax optimal. 

The next corollary shows that the estimator (12) also achieves the minimax 
rate of convergence in a more general setting when M € A{q, s) for 0 < g < 2. 
For any 0 < g < 2 and s > 0 define the quantity 


V'(s) = ni s 


log 1 + 


K‘1 n2 


1 l-qr/2N 


V V (m 712 K^) ■ (17) 


Corollary 2. Let M be the penalized least squares estimator defined in (12) 
with A = Kq where iFo > 0 is large enough. Suppose that Assumption 1 
holds and M S yl(g, s). Then, there exists numerical constant C* such that for 
any A > 0 

\\M - M\\l < C* fi{s) + A 
with probability at least 1 — 2 exp { — }, and 

E\\M-M\\l^<C*fi{s). 


We give the proof of Corollary 2 in Appendix F. If the noise variables £,ij 
are i.i.d Gaussian Af{0, cr^), we have '^{s) = r]{s). Thus, the rate of convergence 
given by (11) is minimax optimal. 


A Proof of Lemma 1 


To prove Lemma 1 we use the Varshamov-Gilbert bound. The volume (cardi¬ 
nality) Vi of {0,1 }^^is 


Vi = 


Note that the volume of the Hamming ball of radius 7ii(s -I- l)/2 in {0, Ij^^xna 
is smaller than the volume V 2 of the Hamming ball of the same radius in a larger 
space of all matrices A = (a^) G such that S {0,1} and A contains 

ni(s -I- 1) 


at most 71 is ones. Let K = 
X. A standard bound implies 

K 


where [xj denotes the integer part of 


/mn2\ ^ ^ 


ni(s-|-l)/2 


i=l ^ 




\ K J 


\s + l J 

where we use that f{x) = xlog j jg growing for x < n\n 2 . 











In order to lower bound Vi we use Stirling’s formula (see, e.g., [10, p. 54]): 
for any j G N 


jl_ji+i/ 2 g with 

e(i2.+i)"^ < ^(^•) < e(i2jr\ 


Using (18) we get 



- 1/6 

n2 —s+1/2 ■ 


v^(^-i) 


(18) 


(19) 


Now, the Varshamov-Gilbert bound implies that there exists a subset fl of 
{0, djy(A, A') > for any A, A' £ A ^ Al and 


|U|> 


r/)^ 




> 


^ e-1/6 

( 1 ^) 

’■‘ 2 + 1/2 ,.=+1 \ 

(s + 1) 2 ^ 

^\/2Tr s 

(?-> 

sn2-s+l/2 

) (26712) 2 J 


which implies 


log |fl| > m 


^ ^ log s - log(v^) + (n 2 + 1 / 2 ) log log(s + 1 ) 


s + 1 


-{n 2 - s + 1/2) log “ l) “ log( 2 eu 2 ) 


> ni 


^ ^ log s - log(v^) + s log - l) 


■log 


2en2 \ 


s + 1 , 


( 20 ) 


1) We first consider the case 501 < s < 712 / 8 . Using that for s > 501, 

we get 


s + 1 


log 


2en2 
s + 1 


251s, /501en2 

< log 


501 


■ 


251s 


< 


100 “ v s ! 


where the last inequality is valid for U 2 /S > 8. 

On the other hand, it is easy to see that for 501 < s < 712/4 we have 


- logs < 0,007slog f — — l) and —h logiV^) < 0,002slog (— — l] . 
2 Vs/6 Vs/ 

Then, (20) implies 

log |U| > O.OIItiiS log > O.OItiis log . 
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for 712/8 > s > 501. 

2) Consider next the case s < 501 and s < 712 /8. Now, instead of the set 
{0,l}®^xra2 {O’linixi where I = [ 712 / 5 ]. Using the 

same arguments as above, we will show that there exists a subset C {0, 
such that duiA^A') > ni/2 for any A, A' & Cl, A ^ A' and log(cardfi) > 
Cni log (en 2 ). In this case, the previous values Ui and V 2 are replaced by 


Vi = r\ '^2 = £ r M < (2e/)”"/^ 

and 

log |ii| > ^ (2 log (0 - log (2e/)) > > lO-^Tiis log (^) 

for 5 < 501 and 772/5 > 8 . To embed Cl in {0, Ij^^xna define 

u = {Ae{o,iK^xn2 : A=(A_^,o), ieii, o g 

s times 


We have il C {0, Ij^^xnai card 17 = card 17 and dH[A,A') > ^ for any 

A,A' & A^ A'. 

3) In order to deal with the case 772/8 < s < nilddd define s' = and 

n 2 = n 2 — {s — s'). Then, 772 > 8s' and we can apply the previous result. This 
implies that there exists a subset 17 of {0, Ij^^xn^ ®cich that 


dH(A A-) 


for any A, A'd Vl, A ^ A' and 

/ p 7 ?' \ 10“"^ 

log(cardl7) > IO^'^tti s' log ( — ^ j > —^— 771 s 




where we used rials' > 772/s. 

To embed 17 in {0, l}^jxn 2 define 


17 = {A G {0,1} 


ni Xn2 




s — s' times 


We have 17 C (0, l}^^xn 2 ’ cardl7 = card 17 and dH{A,A') > ^ for 

any A,A'&^,A^ A'. 

Using exactly the same argument we can treat cases < s < 772/8 and 

772/3 < s < 772/2 to get the statement of Lemma 1. 


10 








B Proof of Lemma 2 

Assume that card (J(A, A')) < Then, denoting by J‘^(A, A') the comple¬ 
ment of J(A, A') and using that card (J^(A, A')) < ni, we get 

d//(A, A') < 2s card (J(A, A')) -|- ^card (J‘^(A, A')) 

^ Til riis uis 

which contradicts the premise of the lemma. 

C Proof of Theorem 2. 

It is enough to prove (i) since (ii) follows from (i) and the Markov inequality. 

To prove (i) we use Theorem 2.5 in [21]. We define fc > 1 be the largest 
integer satisfying 

fc < s tT“'J ^log ^1-I-^ . (21) 

If there is no /c > 1 satisfying (21), take fc = 0. Set fc = fe V 1 and S = k A 
Let fl' C {0, l}njxn 2 given by Lemma 1. We consider 

fl= ' A : AeO'l 

where 0 < r < 1 and 0 < 5 < s will be chosen later. It is easy to see that 
fl C A{q, s). 

Since the noise variables ^ij are i.i.d Gaussian JV(0, cr^), for any two distinct 
B^B' in fl, the Kullback-Leibler divergence KL(Pb,Pb/) between and Ps' 
is given by 

KL(Pb,PbO = (22) 

2 cr^ 

We consider now three cases, depending on the value of the integer k defined 
in (21). 

Case (1): k = 0. Since fc = 0, the inequality (21) is violated for fc = 1, so 
that 

s < cr® (log (1-I-n 2 ))’^^ . (23) 

Here S' = 1 and we take S = s. We have that for any two distinct B, B' in H, 

\\B - (,f-^ (24) 

On the other hand, by Lemma 1, we have that 
log |H| >Cni log (1 -I- 712 ) 
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and using (23) 


1 Til ^ 

KL(Pb,PbO = ^lis-s'llo < 


II -^12 — o 

^ cr^ 

< ni log(l + 712 ) 

< a log |0| 


(25) 


for some 0<Q;<l/8if0<T<lis chosen sufficiently small. 

Case (2): I < k < nij^. We take E = For any two distinct B, B' in 




\B-Br2> 


> 


/||2 ^ niT^{S+l) 


> 


> 


niT 


niT 


niT 


(f)' 

(s)2/«(scr“« + 


1 - 2/9 


• s a 


2-q 


(log ( 


1 + 


712 


J) 


1 - 9/2 


( 26 ) 


sa^ ‘‘ (log (1 + 772 s 


1 - 9/2 


By Lemma 1, we have that 

log |f2| >CniS log (l + 

> s a~‘^ (log (l + 772 cr'^)) 


1 - 9/2 


and 


KL(Pb,PbO = —I|s-s'll2< 


/|,2 ^ T ni ^ 2/9 <-, 1 - 2/9 


2 cr^ 
^2 , 


< (scr (log (1 + 772 5 ^cr'^)) 

< 77l 0-“'^ (log (1 + 772 s“^ cr'^))^ 

< a log |n| 


-9/2N 1-2/9 


( 27 ) 


for some 0 <Q;<l/ 8 if 0 <r<lis chosen sufficiently small. 

Case (3)-. fc> 772 / 2 . Since k > 772 / 2 , the inequality (21) is violated for 
k = so that 

(28) 


In this case S = 772/2 and, using (28), we can take <5 = . We have that for 

any two distinct B, B' in 17, 




18 


( 29 ) 
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On the other hand, by Lemma 1, we have that 


log |0| > Cni 712 


and 


KL{Fb,Vb') = ^\\B- B'Wl < 

< a log |0| 


( 30 ) 


for some 0<a<l/8if0<r<lis chosen sufficiently small. 

Now the statement of the Theorem 2 follows from (24) - (25), (26) - (27), 
(29) - (30) and the Theorem 2.5 in [21]. 

D Proof of Theorem 3. 


This proof essentially follows the scheme suggested in [4] by adding an extension 
to the case of sub-Gaussian noise. Let A £ |-,g fixed, but arbitrary 

matrix. Define for all 1 < r < 711712 


B^ = [A = A'-Ag 


Xn2 


■■ mo = r}. 


Let {Jfe}, /c = 1,..., be all the sets of matrix indices {i,j) of cardinality 

r. Define 


Br,k = {.4 = (oy ) £ Br : a'ij ^ 0 


(i,j) £ Jfc} 


where a\j = Oy + Oy. We have that dim(Sr,fe) < r. Let denote the 

projection of the matrix B onto Br k and pen(A) = A||^||o log ( ). By 

the definition of M, for any A £ 

||y - M\\l + pen(M) < ||y - A||2 + pen(A). 

Rewriting this inequality yields 

||M - M||^ + pen(M) < ||M - A\\l + 2 E ^..(M - A)y + pen(A) 

do) 

< ||M - AI||2 + 2 I I 11^ - ^Il2 + Pen(7l). 




For B = (bij) £ 
1 


we set V{B) = a > 1 

(iJ) 


1 


1 - ^ 1 ||M - M\\i + pen(M) < ( 1 + - ) ||M - + 2aV\M - A) + pen(A). 


(31) 
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/nin2\ 

nin2V r- J 

Next, since = [J |J Br,k, we obtain 


r=0 k—1 


2aV^{M—A)—pen{M)< max max max [2aV^{A) — ]ien.{AA)\ 

0<r<nin2 0<fe<("i”^) 

Note that for r = 0 we have that Bo{A) = {—A} and 

2aV^{—A) — pen(—A + A) = 2aV^{A). 

Let denotes the sparsity pattern oi A = (aij), i.e. 

Ja ~ {(bj) S Nraixn2 • ^ij 0} i 

then for any A € Br,k 



This together with (31) imply 




H-max max 

a — 1 l<r<nin2 0<fc<("'^”^) 


(32) 


By Assumption 1, the errors are sub-gaussian. We will use the following tail 
bounds in order to control the last term in (32). 

Lemma 3. Let Assumption 1 be satisfied. Then, there exists absolute eonstants 
Co, Cl, C 2 , C 3 > 0 such that for Ki = Kq with Kq > 0 large enough 


P max max 

l<r<nin2 0<fc<("i^'*2 



(33) 


E max max 
l<r<mn2 0<fc<("i"2) 


max (lln^ fc(L ;)||2 - ATirlog < cq (34) 

L V r / J 


and 


P[F2(A)-iLi|lA|lo> A] <2exp 



(35) 
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Now (14) follows from Lemma 3 and (32). 

To prove (13), note that by Lemma 3 and (32), for A = 2a Kq there exist 
numerical constants C, Ci, C 2 >0 such that 


a T 1 1 


P inf ^||M-Al||^ + C||Al||olog 

V I a — 1 


e ni 712 


2 a^ 

+--A 

a — 1 




max max {\\^r,k{E)\\l - Kirlog 
l<r<nin 2 L \ r / J 


> a /2 


+ P(l/2(yi)-LCi||yl||o> A/2) 


< Cl exp -C 2 ^ 


which proves (13). 


E Proof of Lemma 3 

We have that 


def, 
PA =1 


1<T 


/eni n2\ 

} > A 

\ r / 

J 


nin2 ( A ^ ) 

E : 

r=l 

711712 / 

77-1712 


^E 




•=i 


Ilr,k{E)\\l > A + iLirlog 
Zr.>A + Kir log - 2rK 


where ~ ^i,..., are i.i.d. random variables satisfying 

Assumption 1. Note that are sub-exponential random variables with < 

2 K^. Applying Bernstein-type inequality (see, e.g.. Proposition 5.16 in [24]) and 

using that i^e get 

PA < 2 ’g ("'/") exp {-a (a'op log + ^])| 

E(^)bxp{-aA.P,og(£:^^)}. 


= 2 exp 


C 2 A 


r—1 


Taking Kq large enough we get 

PA < 2 exp|-^^ j ^exp{-rlog 2 } < Ciexpj- 


r—1 


C 2 A 
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This proves (33) and easily implies the bound on expectation value (34). 
To proof (35), we apply Bernstein-type inequality to V^{A) = 


E ^fj-^{^!j)>Ki\\A\\o-2\\A\\oK^ + A 

< exp|-C 2 (kq Iloilo - ||A||o -t 1 < 2exp|-*^^^ 


2K^ 




F Proof of Corollary 2. 

We use Theorem 3. First, taking A = 0 in (15), we get 

2a^ 


\\M-M\\l<^\\M\\l + ^A 

a — 1 


Cl H- 1 

< - rni s 

a — I 


a — 1 

2/9 _u A 

a — 1 


with probability at least 1 — 2 exp { — }. 

Now, choosing A = M, we obtain that 


\\M-M\\l<CK^\\M\\o\og 


e ni 712 


||M|loVi; a-1 


2 a^ 


■A 


,2 2 . 

a — 1 


< C K m 712 + 


(36) 


(37) 


with probability at least 1 — 2 exp { — }. 

Finally, Theorem 3 implies that for any 1 < s' < 712 / 2 , all a > 1 and any 
A > 0 


\\M-M\\l< inf J^^\\M-A\\l+CKurils'log(l +A (38) 
A€A{2s')a—l \ 2 s'/ a—1 


772 \ 2a^ 


with probability at least 1 — 2 exp { —Now we use the following lemma. 

Lemma 4. Let 1 < s' < 772/2 and 0 < q < 2. For any M G A(g, s), there exists 
A G ^(2s') such that 


\\M-A\\l < s2/9(s')1-2/9„^. 


(39) 


For the proof of this lemma, see Lemma 7.2 in [22] (case 0 < g < 1) and the 
proof of Lemma 7.4 in [22] (case 1 < q < 2). 

Now, (38) and Lemma 4 imply that for any 1 < s' < 772/2 

\\M - M\\l < C 771 s' log (1 + y) + (s')^"2/%i -1- a) . (40) 
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The terms depending on s' on the right side of (40) are balanced by choosing 


c^{\og{l + n2K'^s 1 )) 
with suitable constant c' > 0. With this choice of s we get 


\\M-M\\l < C msK' 


(^log 


1 + n2 




i-g/2 


(41) 


The inequalities (36), (37) and (41) imply the statement of the Corollary 2. 
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